hamming_loss (bitwise error rate for multilabel classification)#

hamming_loss measures the fraction of labels that are wrong.

  • For standard (single-label) classification it reduces to the misclassification rate.

  • For multilabel classification it averages mistakes across the (sample, label) grid — how many bits did we flip?

Learning goals#

  • write the multiclass and multilabel formulas (with clear notation)

  • build intuition with plots (what counts as an error)

  • implement Hamming loss from scratch in NumPy (including sample_weight)

  • see how Hamming loss interacts with probability thresholds in multilabel logistic regression

  • know pros/cons and when to prefer other metrics

Quick import#

from sklearn.metrics import hamming_loss

Table of contents#

  1. Definitions and notation

  2. Intuition (plots)

  3. NumPy implementation + sanity checks

  4. Using Hamming loss for threshold tuning (multilabel logistic regression)

  5. Pros, cons, pitfalls

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import hamming_loss as sk_hamming_loss
from sklearn.model_selection import train_test_split

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.random.seed(0)
np.set_printoptions(precision=3, suppress=True)

1) Definitions and notation#

Assume we have \(n\) samples.

Single-label classification (binary or multiclass)#

  • True label: \(y_i \in \{0,1,\dots,K-1\}\)

  • Predicted label: \(\hat{y}_i \in \{0,1,\dots,K-1\}\)

The Hamming loss is the average number of wrong labels per sample:

\[ \operatorname{HL}(y,\hat{y}) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[y_i \neq \hat{y}_i] \]

For single-label classification this is exactly the misclassification rate (a.k.a. zero_one_loss).

Multilabel classification (label indicator matrix)#

  • True labels: \(Y \in \{0,1\}^{n\times L}\) (each row can have multiple 1s)

  • Predictions: \(\hat{Y} \in \{0,1\}^{n\times L}\)

Hamming loss counts mismatches over all (sample, label) decisions:

\[ \operatorname{HL}(Y,\hat{Y}) = \frac{1}{nL}\sum_{i=1}^n\sum_{\ell=1}^L \mathbf{1}[Y_{i\ell} \neq \hat{Y}_{i\ell}] \]

Equivalently, it is the average Hamming distance per sample, normalized by \(L\).

Relationship to micro-accuracy#

If you treat each (sample, label) as a binary decision, then:

\[ \text{micro-accuracy} = \frac{TP + TN}{nL} \quad\Rightarrow\quad \operatorname{HL} = 1 - \text{micro-accuracy} \]

Contrast with subset accuracy (exact match)#

Subset accuracy (a.k.a. exact match ratio) for multilabel requires getting all labels correct for a sample:

\[ \text{subset-accuracy}(Y,\hat{Y}) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[Y_{i,:} = \hat{Y}_{i,:}] \]

Hamming loss is more forgiving: getting 1 label wrong out of 20 is a small penalty, while subset accuracy would count the whole sample as wrong.

2) Intuition (plots)#

Think of each label decision as a bit.

  • 0 means perfect predictions.

  • 0.25 means 25% of all bits are wrong.

Below we visualize Y_true, Y_pred, and the mismatch matrix (Y_true != Y_pred).

Y_true = np.array(
    [
        [1, 0, 0, 1, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [1, 1, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 1, 1, 0, 0],
        [0, 1, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0],
    ],
    dtype=int,
)

Y_pred = np.array(
    [
        [1, 0, 1, 1, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0],
        [0, 0, 0, 1, 1, 0],
        [1, 0, 0, 1, 0, 0],
        [0, 1, 0, 1, 0, 1],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 0],
    ],
    dtype=int,
)

mismatch = (Y_true != Y_pred).astype(int)

hl_manual = float(mismatch.mean())
hl_sklearn = float(sk_hamming_loss(Y_true, Y_pred))

subset_acc = float(np.mean(np.all(Y_true == Y_pred, axis=1)))

print(f'Hamming loss (manual) : {hl_manual:.3f}')
print(f'Hamming loss (sklearn): {hl_sklearn:.3f}')
print(f'Subset accuracy       : {subset_acc:.3f}')
Hamming loss (manual) : 0.188
Hamming loss (sklearn): 0.188
Subset accuracy       : 0.125
n_samples, n_labels = Y_true.shape
x_labels = [f'label_{j}' for j in range(n_labels)]
y_labels = [f'sample_{i}' for i in range(n_samples)]

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=['Y_true', 'Y_pred', 'Mismatch (1 = wrong)'],
)

fig.add_trace(
    go.Heatmap(
        z=Y_true,
        x=x_labels,
        y=y_labels,
        colorscale='Blues',
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Heatmap(
        z=Y_pred,
        x=x_labels,
        y=y_labels,
        colorscale='Greens',
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=2,
)

fig.add_trace(
    go.Heatmap(
        z=mismatch,
        x=x_labels,
        y=y_labels,
        colorscale=[[0, '#ffffff'], [1, '#d62728']],
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=3,
)

fig.update_layout(
    title=f'Hamming loss = {hl_manual:.3f} (fraction of wrong bits)',
    height=420,
)
fig.show()
per_sample = mismatch.mean(axis=1)
per_label = mismatch.mean(axis=0)

fig1 = px.bar(
    x=[f'sample_{i}' for i in range(n_samples)],
    y=per_sample,
    title='Per-sample contribution: fraction of wrong labels',
    labels={'x': 'sample', 'y': 'wrong-label fraction'},
)
fig1.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig1.update_yaxes(range=[0, 1])
fig1.show()

fig2 = px.bar(
    x=x_labels,
    y=per_label,
    title='Per-label error rate',
    labels={'x': 'label', 'y': 'error rate'},
)
fig2.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig2.update_yaxes(range=[0, 1])
fig2.show()

A common pitfall: multiclass as one-hot vs integer labels#

For multiclass problems you often have one true class per sample.

  • If you pass integer labels (shape = (n,)), Hamming loss is the misclassification rate.

  • If you convert to one-hot (shape = (n, K)), a single wrong prediction creates two bit errors (one FN + one FP), so the value changes.

Below we compare the two representations.

y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])

hl_int = float(sk_hamming_loss(y_true_mc, y_pred_mc))

K = 3
Y_true_oh = np.eye(K, dtype=int)[y_true_mc]
Y_pred_oh = np.eye(K, dtype=int)[y_pred_mc]

hl_onehot = float(sk_hamming_loss(Y_true_oh, Y_pred_oh))

print(f'Hamming loss with integer labels: {hl_int:.3f}')
print(f'Hamming loss with one-hot labels: {hl_onehot:.3f}  (note the scaling)')
Hamming loss with integer labels: 0.333
Hamming loss with one-hot labels: 0.222  (note the scaling)

3) NumPy implementation + sanity checks#

A from-scratch implementation is straightforward once you remember the definition: count mismatches and average.

sample_weight in sklearn.metrics.hamming_loss applies at the sample level:

  • compute per-sample Hamming loss (mean mismatches across labels)

  • take a weighted average across samples

def hamming_loss_np(y_true, y_pred, *, sample_weight=None) -> float:
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f'shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}')

    if y_true.ndim == 1:
        mismatches = (y_true != y_pred).astype(float)
        if sample_weight is None:
            return float(mismatches.mean())

        w = np.asarray(sample_weight, dtype=float)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
        return float(np.average(mismatches, weights=w))

    if y_true.ndim == 2:
        mismatches = (y_true != y_pred).astype(float)
        per_sample = mismatches.mean(axis=1)

        if sample_weight is None:
            return float(per_sample.mean())

        w = np.asarray(sample_weight, dtype=float)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
        return float(np.average(per_sample, weights=w))

    raise ValueError('y_true and y_pred must be 1D (single-label) or 2D (multilabel)')
# Sanity checks vs scikit-learn
rng = np.random.default_rng(0)

# 1) single-label (multiclass)
y_true_1d = rng.integers(0, 4, size=200)
y_pred_1d = rng.integers(0, 4, size=200)

print(
    '1D close?',
    np.allclose(
        hamming_loss_np(y_true_1d, y_pred_1d),
        sk_hamming_loss(y_true_1d, y_pred_1d),
    ),
)

# 2) multilabel indicator
Y_true_2d = rng.integers(0, 2, size=(120, 7))
Y_pred_2d = rng.integers(0, 2, size=(120, 7))

print(
    '2D close?',
    np.allclose(
        hamming_loss_np(Y_true_2d, Y_pred_2d),
        sk_hamming_loss(Y_true_2d, Y_pred_2d),
    ),
)

# 3) sample weights
w = rng.random(size=Y_true_2d.shape[0])

hl_np_w = hamming_loss_np(Y_true_2d, Y_pred_2d, sample_weight=w)
hl_sk_w = float(sk_hamming_loss(Y_true_2d, Y_pred_2d, sample_weight=w))

print('weighted close?', np.allclose(hl_np_w, hl_sk_w))
print('weighted value:', hl_np_w)
1D close? True
2D close? True
weighted close? True
weighted value: 0.528672497390485

4) Using Hamming loss for threshold tuning (multilabel logistic regression)#

Hamming loss is defined on hard predictions (0/1), so it is not differentiable.

A common pattern is:

  1. Train a probabilistic model (e.g. multilabel logistic regression) by minimizing a differentiable surrogate (binary cross-entropy / log_loss).

  2. Convert probabilities to hard labels with a threshold \(t\).

  3. Choose \(t\) (or per-label thresholds) to minimize Hamming loss on a validation set.

Model#

For \(L\) labels, we use independent sigmoids:

\[ Z = XW + b,\quad P = \sigma(Z) \]

Prediction with a threshold \(t\):

\[ \hat{Y}_{i\ell} = \mathbf{1}[P_{i\ell} \ge t] \]

We will train with average binary cross-entropy (from logits):

\[ J(W,b) = \frac{1}{nL}\sum_{i=1}^n\sum_{\ell=1}^L \Big(\operatorname{softplus}(Z_{i\ell}) - Y_{i\ell} Z_{i\ell}\Big) \]

Then we will tune \(t\) to minimize Hamming loss.

def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return 1.0 / (1.0 + np.exp(-z))


def softplus(z):
    # Stable softplus: log(1 + exp(z))
    z = np.asarray(z, dtype=float)
    return np.log1p(np.exp(-np.abs(z))) + np.maximum(z, 0.0)


def bce_from_logits(Y, Z) -> float:
    Y = np.asarray(Y, dtype=float)
    Z = np.asarray(Z, dtype=float)
    return float(np.mean(softplus(Z) - Y * Z))


def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_multilabel_logreg_gd(
    X_train,
    Y_train,
    X_val=None,
    Y_val=None,
    *,
    lr=0.8,
    n_steps=400,
    l2=0.0,
    threshold=0.5,
):
    X_train = np.asarray(X_train, dtype=float)
    Y_train = np.asarray(Y_train, dtype=float)
    n_samples, n_features = X_train.shape
    n_labels = Y_train.shape[1]

    W = np.zeros((n_features, n_labels))
    b = np.zeros(n_labels)

    history = {
        'step': [],
        'train_bce': [],
        'train_hl': [],
        'val_bce': [],
        'val_hl': [],
    }

    for step in range(n_steps):
        Z = X_train @ W + b
        P = sigmoid(Z)

        train_bce = bce_from_logits(Y_train, Z)

        # dJ/dZ = (P - Y) / (n_samples * n_labels) when J is the mean over all entries
        G = (P - Y_train) / (n_samples * n_labels)
        grad_W = X_train.T @ G + l2 * W
        grad_b = G.sum(axis=0)

        W -= lr * grad_W
        b -= lr * grad_b

        Y_hat = (P >= threshold).astype(int)
        train_hl = hamming_loss_np(Y_train.astype(int), Y_hat)

        history['step'].append(step)
        history['train_bce'].append(train_bce)
        history['train_hl'].append(train_hl)

        if X_val is not None and Y_val is not None:
            Z_val = X_val @ W + b
            P_val = sigmoid(Z_val)
            val_bce = bce_from_logits(Y_val, Z_val)
            val_hl = hamming_loss_np(Y_val.astype(int), (P_val >= threshold).astype(int))

            history['val_bce'].append(val_bce)
            history['val_hl'].append(val_hl)
        else:
            history['val_bce'].append(None)
            history['val_hl'].append(None)

    return W, b, history
# Synthetic multilabel dataset
rng = np.random.default_rng(1)

n_samples = 1600
n_features = 8
n_labels = 6

X = rng.normal(size=(n_samples, n_features))

W_true = rng.normal(scale=1.2, size=(n_features, n_labels))
# Make some labels rarer than others by shifting biases
b_true = np.linspace(-2.0, 0.5, n_labels)

Z_true = X @ W_true + b_true
P_true = sigmoid(Z_true)
Y = (rng.random(size=P_true.shape) < P_true).astype(int)

X_train, X_val, Y_train, Y_val = train_test_split(
    X,
    Y,
    test_size=0.3,
    random_state=0,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

W, b, hist = fit_multilabel_logreg_gd(
    X_train_s,
    Y_train,
    X_val=X_val_s,
    Y_val=Y_val,
    lr=0.9,
    n_steps=300,
    l2=0.0,
    threshold=0.5,
)

fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_bce'], name='train BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_bce'], name='val BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_hl'], name='train Hamming loss'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_hl'], name='val Hamming loss'), secondary_y=True)

fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='binary cross-entropy (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='Hamming loss (lower is better)', secondary_y=True, range=[0, 1])
fig.update_layout(title='Train with BCE, monitor Hamming loss at threshold=0.5', height=480)
fig.show()
# Tune the probability threshold to minimize validation Hamming loss
Z_val = X_val_s @ W + b
P_val = sigmoid(Z_val)

thresholds = np.linspace(0.05, 0.95, 91)
hl_vals = []
for t in thresholds:
    Y_hat_val = (P_val >= t).astype(int)
    hl_vals.append(hamming_loss_np(Y_val, Y_hat_val))

hl_vals = np.array(hl_vals)
best_idx = int(np.argmin(hl_vals))
best_t = float(thresholds[best_idx])

t05_idx = int(np.where(np.isclose(thresholds, 0.5))[0][0])
hl_at_05 = float(hl_vals[t05_idx])
hl_best = float(hl_vals[best_idx])

print(f'Validation HL at t=0.50: {hl_at_05:.4f}')
print(f'Best threshold t*:      {best_t:.2f}')
print(f'Validation HL at t*:    {hl_best:.4f}')
Validation HL at t=0.50: 0.1437
Best threshold t*:      0.51
Validation HL at t*:    0.1420
fig = px.line(
    x=thresholds,
    y=hl_vals,
    title='Validation Hamming loss vs threshold',
    labels={'x': 'threshold t', 'y': 'Hamming loss'},
)
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='t=0.5')
fig.add_vline(x=best_t, line_dash='dash', line_color='green', annotation_text='best t*')
fig.update_yaxes(range=[0, 1])
fig.show()
# Optional: per-label threshold tuning (can reduce HL when base rates differ)
per_label_thresholds = np.zeros(n_labels)

for j in range(n_labels):
    errs = []
    for t in thresholds:
        pred_j = (P_val[:, j] >= t).astype(int)
        errs.append(float(np.mean(pred_j != Y_val[:, j])))
    per_label_thresholds[j] = thresholds[int(np.argmin(errs))]

Y_hat_per_label = (P_val >= per_label_thresholds).astype(int)
hl_per_label = hamming_loss_np(Y_val, Y_hat_per_label)

print('Per-label thresholds:', np.round(per_label_thresholds, 2))
print('Validation HL (single t*) :', hl_best)
print('Validation HL (per-label) :', hl_per_label)

fig = px.bar(
    x=[f'label_{j}' for j in range(n_labels)],
    y=per_label_thresholds,
    title='Per-label thresholds that minimize per-label error',
    labels={'x': 'label', 'y': 'best threshold'},
)
fig.update_yaxes(range=[0, 1])
fig.show()
Per-label thresholds: [0.48 0.68 0.46 0.51 0.46 0.5 ]
Validation HL (single t*) : 0.14201388888888886
Validation HL (per-label) : 0.13819444444444445

5) Pros, cons, pitfalls#

Pros#

  • Simple and interpretable: “fraction of wrong labels.”

  • Works naturally for multilabel: does not require perfect set matches.

  • Label-wise averaging: each label decision contributes equally (micro view over all bits).

  • Comparable across models when the label space is fixed (same \(L\)).

Cons / caveats#

  • Can look deceptively good on sparse multilabel problems: if most labels are 0, predicting all zeros yields many true negatives and a low Hamming loss.

  • Does not capture set quality: predicting a wrong combination can still have a small Hamming loss if only a few bits differ.

  • Not differentiable: not suitable as a direct gradient-based training objective; use a surrogate loss and treat Hamming loss as an evaluation metric.

  • Representation matters for multiclass: integer labels vs one-hot produce different scales.

Common pitfalls#

  • Passing probabilities instead of hard labels (threshold them first).

  • Using one-hot for multiclass and interpreting the value as misclassification rate.

  • Relying on Hamming loss alone with heavy class imbalance; complement with per-label precision/recall/F1, Jaccard score, or subset accuracy.

Where it’s a good fit#

  • Multilabel tagging where each label decision matters roughly equally (e.g. topic tags, attribute prediction).

  • Problems where you want a single number that reflects “average per-label error rate,” not strict exact matches.

Exercises#

  • Show algebraically that for multilabel indicators, Hamming loss = 1 - micro-accuracy.

  • Construct a sparse multilabel dataset where predicting all zeros achieves a low Hamming loss but terrible recall.

  • Implement a per-label F1 score and compare its behavior to Hamming loss under imbalance.

References#

  • scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

  • Hamming distance (background): https://en.wikipedia.org/wiki/Hamming_distance